mean_tweedie_deviance (Mean Tweedie Deviance)#
mean_tweedie_deviance is a regression metric / loss from the Tweedie family of distributions.
It is a great fit when targets are non-negative, often skewed, and may contain many zeros.
Typical examples:
insurance pricing: claim cost per policy (many zeros + positive continuous)
rainfall amounts (many zeros + positive continuous)
event counts (Poisson special case)
Learning goals#
Understand the Tweedie power parameter
pand special cases (Normal / Poisson / Gamma)Write the metric with clear math + domain constraints
Implement
mean_tweedie_deviancefrom scratch in NumPy (incl. weights + edge cases)Build intuition with plots: asymmetry, effect of
p, and sensitivity to extreme errorsUse the deviance as a training loss: fit a simple Tweedie GLM with gradient descent
Prerequisites#
Basic calculus (derivatives)
Familiarity with regression + linear models
Helpful: GLMs and link functions (we use a log link)
2.5) Probabilistic view: deviance as (scaled) negative log-likelihood#
For Tweedie GLMs, the deviance comes from the log-likelihood. You can view it as the extra negative log-likelihood you pay by predicting \(\hat{\mu}\) instead of the (unrealistic) perfectly-fitting saturated model.
For an exponential dispersion model, the (unit) deviance can be written as:
where \(\mathcal{L}(\mu)\) is the part of the negative log-likelihood that depends on \(\mu\). For \(p\notin\{0,1,2\}\) (ignoring constants that don’t depend on \(\mu\)):
and the saturated model corresponds to setting \(\mu=y\) for each sample. So minimizing mean Tweedie deviance is (up to a scale) the same as maximum likelihood for the Tweedie family.
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.linear_model import TweedieRegressor
from sklearn.metrics import mean_tweedie_deviance
pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) The Tweedie family (why this metric exists)#
The Tweedie family is an exponential dispersion model with a mean–variance relationship:
\(\mu = \mathbb{E}[Y]\) is the mean
\(\phi>0\) is a dispersion (scale) parameter
\(p\) is the power parameter: it determines the distribution type and how variance scales with the mean
In scikit-learn, mean_tweedie_deviance is defined for p \le 0 or p \ge 1.
Common special cases:
power |
distribution (typical) |
support |
|---|---|---|
|
Normal |
real-valued |
|
Poisson |
counts (0,1,2,…) |
|
Compound Poisson–Gamma |
mass at 0 + positive continuous |
|
Gamma |
positive continuous |
|
Inverse Gaussian |
positive continuous |
The main “wow” case is 1 < p < 2: it models data that is often zero but otherwise continuous and skewed.
2) Definition#
Let \(y_i\) be targets and \(\hat{\mu}_i\) be predictions (think: predicted mean). The per-sample Tweedie deviance is:
For \(p \notin \{0,1,2\}\):
Special cases:
\(p=0\) (Normal):
\[d_0(y, \hat{\mu}) = (y-\hat{\mu})^2\]\(p=1\) (Poisson):
\[d_1(y, \hat{\mu}) = 2\left( y\log\frac{y}{\hat{\mu}} + \hat{\mu}-y \right),\quad 0\log 0 := 0\]\(p=2\) (Gamma):
\[d_2(y, \hat{\mu}) = 2\left( \log\frac{\hat{\mu}}{y} + \frac{y}{\hat{\mu}} - 1 \right)\]
The scikit-learn metric is the mean deviance:
Domain constraints (important)#
scikit-learn enforces these domains:
p < 0: \(\hat{\mu} > 0\) (targets may be any real number)p = 0: no constraints (\(y,\hat{\mu}\in\mathbb{R}\))1 \le p < 2: \(y \ge 0\) and \(\hat{\mu} > 0\)p \ge 2: \(y > 0\) and \(\hat{\mu} > 0\)
# Quick demo (using scikit-learn)
y_true = np.array([2.0, 0.0, 1.0, 4.0])
y_pred = np.array([0.5, 0.5, 2.0, 2.0])
print('power=1 (Poisson):', mean_tweedie_deviance(y_true, y_pred, power=1))
print('power=1.5 (Tweedie):', mean_tweedie_deviance(y_true, y_pred, power=1.5))
y_true_pos = np.array([2.0, 1.0, 4.0, 3.0])
y_pred_pos = np.array([1.5, 0.7, 2.5, 3.4])
print('power=0 (MSE):', mean_tweedie_deviance(y_true_pos, y_pred_pos, power=0))
print('power=2 (Gamma):', mean_tweedie_deviance(y_true_pos, y_pred_pos, power=2))
power=1 (Poisson): 1.4260151319598084
power=1.5 (Tweedie): 1.7781745930520232
power=0 (MSE): 0.6875
power=2 (Gamma): 0.12753010019949473
3) NumPy implementation (from scratch)#
We’ll implement two functions:
tweedie_deviance_per_sample_np→ returns \(d_p(y_i, \hat{\mu}_i)\) for each samplemean_tweedie_deviance_np→ averages the per-sample deviances (optionally weighted)
Goal: match scikit-learn behavior, including input validation.
def _check_1d_float(x, name):
x = np.asarray(x, dtype=float)
if x.ndim != 1:
raise ValueError(f'{name} must be 1D, got shape {x.shape!r}')
return x
def _xlogy(x, y):
"""Compute x * log(y) with the convention 0 * log(y) = 0.
This avoids the RuntimeWarning you get from np.where(..., 0, x*np.log(y)).
"""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
x, y = np.broadcast_arrays(x, y)
out = np.zeros_like(x, dtype=float)
mask = x != 0
out[mask] = x[mask] * np.log(y[mask])
return out
def tweedie_deviance_per_sample_np(y_true, y_pred, *, power=0.0):
y_true = _check_1d_float(y_true, 'y_true')
y_pred = _check_1d_float(y_pred, 'y_pred')
if y_true.shape[0] != y_pred.shape[0]:
raise ValueError('y_true and y_pred must have the same length')
p = float(power)
if not (p <= 0 or p >= 1):
raise ValueError('power must be <= 0 or >= 1')
msg = f'Mean Tweedie deviance error with power={p} can only be used on '
if p < 0:
if np.any(y_pred <= 0):
raise ValueError(msg + 'strictly positive y_pred.')
elif p == 0:
pass
elif 1 <= p < 2:
if np.any(y_true < 0) or np.any(y_pred <= 0):
raise ValueError(msg + 'non-negative y and strictly positive y_pred.')
elif p >= 2:
if np.any(y_true <= 0) or np.any(y_pred <= 0):
raise ValueError(msg + 'strictly positive y and y_pred.')
if p < 0:
# Matches scikit-learn: use y_true^... only when y_true > 0.
y_pos = np.where(y_true > 0, y_true, 0.0)
dev = 2 * (
np.power(y_pos, 2 - p) / ((1 - p) * (2 - p))
- y_true * np.power(y_pred, 1 - p) / (1 - p)
+ np.power(y_pred, 2 - p) / (2 - p)
)
elif p == 0:
dev = (y_true - y_pred) ** 2
elif p == 1:
# 2 * ( y*log(y/mu) - y + mu )
dev = 2 * (_xlogy(y_true, y_true / y_pred) - y_true + y_pred)
elif p == 2:
dev = 2 * (np.log(y_pred / y_true) + y_true / y_pred - 1)
else:
dev = 2 * (
np.power(y_true, 2 - p) / ((1 - p) * (2 - p))
- y_true * np.power(y_pred, 1 - p) / (1 - p)
+ np.power(y_pred, 2 - p) / (2 - p)
)
return dev
def mean_tweedie_deviance_np(y_true, y_pred, *, power=0.0, sample_weight=None):
dev = tweedie_deviance_per_sample_np(y_true, y_pred, power=power)
if sample_weight is None:
return float(np.mean(dev))
w = _check_1d_float(sample_weight, 'sample_weight')
if w.shape[0] != dev.shape[0]:
raise ValueError('sample_weight must have the same length as y')
if np.any(w < 0):
raise ValueError('sample_weight must be non-negative')
if float(np.sum(w)) == 0.0:
raise ValueError('sum of sample_weight must be > 0')
return float(np.sum(w * dev) / np.sum(w))
# Verify that our implementation matches scikit-learn
for p in [0, 1, 1.5, 2, 3, -1]:
if p == 0:
y = rng.normal(size=300)
mu = rng.normal(size=300)
elif p < 2:
y = rng.poisson(lam=2.0, size=300).astype(float)
mu = np.exp(rng.normal(size=300))
else:
y = np.exp(rng.normal(size=300))
mu = np.exp(rng.normal(size=300))
w = rng.uniform(0.1, 2.0, size=y.shape[0])
sk = mean_tweedie_deviance(y, mu, power=p)
npv = mean_tweedie_deviance_np(y, mu, power=p)
sk_w = mean_tweedie_deviance(y, mu, power=p, sample_weight=w)
np_w = mean_tweedie_deviance_np(y, mu, power=p, sample_weight=w)
print(f'power={p:>4}: abs diff={abs(sk-npv):.3e}, abs diff (weighted)={abs(sk_w-np_w):.3e}')
power= 0: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power= 1: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power= 1.5: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power= 2: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power= 3: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
power= -1: abs diff=0.000e+00, abs diff (weighted)=0.000e+00
4) Intuition with plots#
Two important intuitions:
Asymmetry (for
p >= 1): underpredicting a positive target is usually punished much more than overpredicting.Power
pcontrols extreme sensitivity: largerptends to reduce the penalty for very large deviations.
We’ll visualize the deviance as a function of the multiplicative factor:
where \(f<1\) means underprediction and \(f>1\) means overprediction.
y0 = 10.0
f = np.logspace(-2, 2, 600) # multiplicative factor
mu = y0 * f
powers = [0, 1, 1.5, 2, 3]
fig = go.Figure()
for p in powers:
dev = tweedie_deviance_per_sample_np(np.full_like(mu, y0), mu, power=p)
fig.add_trace(go.Scatter(x=f, y=np.log1p(dev), mode='lines', name=f'p={p}'))
fig.update_layout(
title='Tweedie deviance vs multiplicative error (plotted as log(1 + deviance))',
xaxis_title='factor f in μ̂ = y · f (log scale)',
yaxis_title='log(1 + deviance)',
)
fig.update_xaxes(type='log')
fig.add_vline(x=1.0, line_dash='dash', line_color='rgba(0,0,0,0.4)')
fig.show()
Why underprediction can hurt a lot#
For p > 0, the loss includes terms like \(\log\hat{\mu}\) (Poisson / Gamma special cases) or powers of \(\hat{\mu}\).
This makes the deviance blow up when:
\(y>0\) and \(\hat{\mu}\to 0^+\) (severe underprediction)
In optimization, this shows up directly in the gradient.
mu_grid = np.logspace(-3, 3, 600)
y_fixed = 10.0
fig = go.Figure()
for p in [1, 1.5, 2, 3]:
# For p!=0, the derivative wrt μ is: ∂d/∂μ = 2 * μ^{-p} * (μ - y)
grad_mu = 2 * (mu_grid ** (-p)) * (mu_grid - y_fixed)
fig.add_trace(go.Scatter(x=mu_grid, y=np.abs(grad_mu), mode='lines', name=f'|∂d/∂μ|, p={p}'))
fig.update_layout(
title='Gradient magnitude vs prediction μ̂ (fixed y=10): small μ̂ can create huge gradients',
xaxis_title='μ̂ (log scale)',
yaxis_title='|∂d/∂μ| (log scale)',
)
fig.update_xaxes(type='log')
fig.update_yaxes(type='log')
fig.show()
5) Using Tweedie deviance to fit a model (low-level optimization)#
mean_tweedie_deviance is not only an evaluation metric: it is also a natural training loss for Tweedie GLMs.
We’ll fit a 1D log-link model:
The log link guarantees \(\hat{\mu}_i>0\), which is required for p \ne 0.
Objective#
Gradient (for p != 0)#
A useful identity (valid for p != 0) is:
With the log link \(\hat{\mu}=e^{\eta}\), we have \(\frac{\partial \hat{\mu}}{\partial \eta}=\hat{\mu}\), so:
Then:
We’ll use gradient descent to minimize \(J\) for a synthetic dataset that looks like compound Poisson data (1 < p < 2).
# Synthetic compound Poisson–Gamma data (mass at 0 + positive continuous)
n = 500
x = rng.uniform(-2, 2, size=n)
b0_true = 0.2
b1_true = 0.8
# True mean curve
mu_true = np.exp(b0_true + b1_true * x)
# Compound Poisson–Gamma construction:
# N ~ Poisson(lambda)
# Severity ~ Gamma(k, theta)
# Y = sum_{j=1..N} Severity_j
# Then Y has a point mass at 0 and a continuous positive tail.
severity_shape = 2.0
severity_scale = 0.5 # mean severity = shape * scale = 1.0
sev_mean = severity_shape * severity_scale
lam = mu_true / sev_mean
counts = rng.poisson(lam)
y = np.zeros(n)
mask = counts > 0
y[mask] = rng.gamma(shape=counts[mask] * severity_shape, scale=severity_scale)
fig = make_subplots(rows=1, cols=2, subplot_titles=('y vs x', 'y distribution'))
fig.add_trace(go.Scatter(x=x, y=y, mode='markers', marker=dict(size=5, opacity=0.6), name='y'), row=1, col=1)
fig.add_trace(go.Histogram(x=y, nbinsx=60, name='hist(y)'), row=1, col=2)
fig.update_xaxes(title_text='x', row=1, col=1)
fig.update_yaxes(title_text='y', row=1, col=1)
fig.update_xaxes(title_text='y', row=1, col=2)
fig.update_yaxes(title_text='count', row=1, col=2)
fig.update_layout(title='Synthetic compound Poisson-like regression target', showlegend=False)
fig.show()
def exp_clip(z, clip=20.0):
"""exp(z) with clipping to avoid overflow during training."""
return np.exp(np.clip(z, -clip, clip))
def tweedie_glm_loss_and_grad(b0, b1, x, y, *, power):
"""Return (loss, db0, db1) for μ̂ = exp(b0 + b1 x)."""
eta = b0 + b1 * x
mu = exp_clip(eta)
loss = mean_tweedie_deviance_np(y, mu, power=power)
p = float(power)
if p == 0:
raise ValueError('This helper assumes power != 0 (use MSE directly for p=0).')
# ∂d/∂η = 2 * (μ^(2-p) - y * μ^(1-p))
grad_eta = 2 * (mu ** (2 - p) - y * mu ** (1 - p))
db0 = float(np.mean(grad_eta))
db1 = float(np.mean(grad_eta * x))
return loss, db0, db1
def fit_tweedie_glm_gd(x, y, *, power, lr=0.05, n_steps=600):
b0, b1 = 0.0, 0.0
history = {
'step': [],
'b0': [],
'b1': [],
'loss': [],
}
for step in range(n_steps):
loss, db0, db1 = tweedie_glm_loss_and_grad(b0, b1, x, y, power=power)
history['step'].append(step)
history['b0'].append(b0)
history['b1'].append(b1)
history['loss'].append(loss)
b0 -= lr * db0
b1 -= lr * db1
return (b0, b1), history
power = 1.5
(b0_hat, b1_hat), hist = fit_tweedie_glm_gd(x, y, power=power, lr=0.05, n_steps=700)
mu_hat = exp_clip(b0_hat + b1_hat * x)
print(f'true params: b0={b0_true:.3f}, b1={b1_true:.3f}')
print(f'GD params: b0={b0_hat:.3f}, b1={b1_hat:.3f}, mean deviance={mean_tweedie_deviance_np(y, mu_hat, power=power):.4f}')
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist['step'], y=hist['loss'], mode='lines', name='mean deviance'))
fig.update_layout(
title='Gradient descent on mean Tweedie deviance',
xaxis_title='step',
yaxis_title='mean Tweedie deviance',
)
fig.show()
# Visualize fitted mean curve
x_line = np.linspace(x.min(), x.max(), 300)
mu_line_true = np.exp(b0_true + b1_true * x_line)
mu_line_hat = np.exp(b0_hat + b1_hat * x_line)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode='markers', name='data', marker=dict(size=5, opacity=0.55)))
fig.add_trace(go.Scatter(x=x_line, y=mu_line_true, mode='lines', name='true E[y|x]', line=dict(dash='dash')))
fig.add_trace(go.Scatter(x=x_line, y=mu_line_hat, mode='lines', name='GD fitted mean'))
fig.update_layout(title='Fitting the mean with Tweedie deviance (log link)', xaxis_title='x', yaxis_title='y')
fig.show()
true params: b0=0.200, b1=0.800
GD params: b0=0.192, b1=0.783, mean deviance=1.7361
# Visualize the loss landscape over (b0, b1) and the GD path
# (This is not a quadratic bowl like MSE, but is often nicely behaved for 1 <= p <= 2.)
b0_grid = np.linspace(b0_hat - 1.4, b0_hat + 1.4, 120)
b1_grid = np.linspace(b1_hat - 1.4, b1_hat + 1.4, 120)
B0, B1 = np.meshgrid(b0_grid, b1_grid)
ETA = B0[:, :, None] + B1[:, :, None] * x[None, None, :]
MU = exp_clip(ETA)
p = float(power)
# p=1.5 here (general formula)
const = np.power(y, 2 - p) / ((1 - p) * (2 - p))
DEV = 2 * (
const[None, None, :]
- y[None, None, :] * np.power(MU, 1 - p) / (1 - p)
+ np.power(MU, 2 - p) / (2 - p)
)
Z = np.mean(DEV, axis=2)
stride = max(1, len(hist['step']) // 140)
b0_path = np.array(hist['b0'])[::stride]
b1_path = np.array(hist['b1'])[::stride]
fig = go.Figure()
fig.add_trace(
go.Contour(
x=b0_grid,
y=b1_grid,
z=Z,
contours_coloring='heatmap',
colorbar=dict(title='mean deviance'),
)
)
fig.add_trace(
go.Scatter(
x=b0_path,
y=b1_path,
mode='lines+markers',
name='GD path',
marker=dict(size=4, color='black'),
line=dict(color='black'),
)
)
fig.update_layout(
title='Mean Tweedie deviance landscape + gradient descent trajectory',
xaxis_title='b0 (intercept)',
yaxis_title='b1 (slope)',
)
fig.show()
6) Practical usage (scikit-learn)#
As a metric:
from sklearn.metrics import mean_tweedie_deviance
mean_tweedie_deviance(y_true, y_pred, power=1.5)
As a model loss: scikit-learn provides TweedieRegressor, which minimizes a Tweedie deviance objective
(optionally with L2 regularization via alpha).
Choosing power is usually treated as a hyperparameter (cross-validate over a small grid like [1.1, 1.3, 1.5, 1.7, 1.9]).
X = x.reshape(-1, 1)
reg = TweedieRegressor(power=power, link='log', alpha=0.0, max_iter=5000)
reg.fit(X, y)
mu_sk = reg.predict(X)
print('sklearn intercept_:', float(reg.intercept_))
print('sklearn coef_ :', reg.coef_)
print('mean deviance (sklearn fit):', mean_tweedie_deviance(y, mu_sk, power=power))
print('mean deviance (our GD fit) :', mean_tweedie_deviance(y, exp_clip(b0_hat + b1_hat * x), power=power))
sklearn intercept_: 0.19246904943966447
sklearn coef_ : [0.7827]
mean deviance (sklearn fit): 1.7361175245303142
mean deviance (our GD fit) : 1.7361175244957865
7) Pros, cons, and when to use it#
Pros#
Principled: corresponds to (scaled) negative log-likelihood for Tweedie GLMs
Handles zeros + positive continuous when
1 < p < 2(compound Poisson)Tunable via
p: bridges MSE (p=0), Poisson deviance (p=1), Gamma deviance (p=2)Naturally heteroscedastic: matches data where variance grows with the mean
Cons#
Requires selecting a good power parameter
p(mis-specification hurts)Domain constraints:
y_predmust be strictly positive forp != 0(andy_truemust be positive forp >= 2)Less “unit-interpretable” than MAE/MSE (involves powers/logs)
Can be numerically harsh if a model predicts values near 0 while targets are positive
Good use cases#
p=1Poisson: count data (calls per hour, clicks per day, defects per batch)1 < p < 2compound Poisson: claim cost, rainfall, “amount spent” with many zerosp=2Gamma: positive continuous targets with multiplicative noise (severity, durations, rates)
8) Common pitfalls + diagnostics#
Check your target support: if you have zeros, avoid
p >= 2.Ensure positivity of predictions for
p != 0(log link or clipping with a small epsilon).Watch for huge contributions: plot per-sample deviances to spot underprediction near \(\hat{\mu}\approx 0\).
Choose
pdeliberately: cross-validate;1 < p < 2is often a strong starting point for zero-heavy positive targets.
9) Exercises#
Implement
d2_tweedie_scoreand show how it relates to the mean deviance.On the synthetic dataset above, try
p in {1.1, 1.3, 1.5, 1.7, 1.9}and compare fitted curves and final loss.Modify the data generator to increase the fraction of zeros. Does the best
pshift?
References#
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_tweedie_deviance.html
scikit-learn user guide (Tweedie regression): https://scikit-learn.org/stable/modules/linear_model.html#tweedie-regression
Jørgensen, B. (1997). The Theory of Dispersion Models.